Candidate Document Retrieval for Web-Scale Text Reuse Detection
نویسندگان
چکیده
Given a document d, the task of text reuse detection is to find those passages in d which in identical or paraphrased form also appear in other documents. To solve this problem at web-scale, keywords representing d’s topics have to be combined to web queries. The retrieved web documents can then be delivered to a text reuse detection system for an in-depth analysis. We focus on the query formulation problem as the crucial first step in the detection process and present a new query formulation strategy that achieves convincing results: compared to a maximal termset query formulation strategy [10, 14], which is the most sensible non-heuristic baseline, we save on average 70% of the queries in realistic experiments. With respect to the candidate documents’ quality, our heuristic retrieves documents that are, on average, more similar to the given document than the results of previously published query formulation strategies [4, 8].
منابع مشابه
Approaches for Source Retrieval and Text Alignment of Plagiarism Detection Notebook for PAN at CLEF 2013
In this paper, we describe our approach at the PAN@CLEF2013 plagiarism detection competition. In sub-task of Source Retrieval, a method combined TF-IDF, PatTree and Weighted TF-IDF to extract the keywords of suspicious documents as queries to retrieve the plagiarism source document is proposed. In sub-task of Text Alignment, a method based on sentence similarity is presented. Our text alignment...
متن کاملAutomatic In-Text Keyword Tagging based on Information Retrieval
As shown in Wikipedia, tagging or cross-linking through major keywords in a document collection improves not only the readability of documents but also responsive and adaptive navigation among related documents. In recent years, the Semantic Web has increased the importance of social tagging as a key feature of the Web 2.0 and, as its crucial phenotype, Tag Cloud has emerged to the public. In t...
متن کاملApproaches for Candidate Document Retrieval and Detailed Comparison of Plagiarism Detection
In this paper we report on our plagiarism detection system which is used to process the PAN plagiarism corpus for the tasks of Candidate Document Retrieval and Detailed Comparison. To retrieve the plagiarism candidate document by using ChatNoir API, a method based on tf*idf to extract the keywords of suspicious documents as queries is proposed. An Lucene ranking method is used for plagiarism ca...
متن کاملA New Document Embedding Method for News Classification
Abstract- Text classification is one of the main tasks of natural language processing (NLP). In this task, documents are classified into pre-defined categories. There is lots of news spreading on the web. A text classifier can categorize news automatically and this facilitates and accelerates access to the news. The first step in text classification is to represent documents in a suitable way t...
متن کاملDetection of Text with Connected Component Clustering
Text detection and recognition is a hot topic for researchers in the field of image processing. It gives attention to Content based Image Retrieval (CBIR) community in order to fill the semantic gap between low level and high level features. Several methods have been developed for text detection and extraction that achieve reasonable accuracy for natural scene text (camera images) as well as mu...
متن کامل